Starfish: A Self-tuning System for Big Data Analytics

نویسندگان

  • Herodotos Herodotou
  • Harold Lim
  • Gang Luo
  • Nedyalko Borisov
  • Liang Dong
  • Fatma Bilgen Cetin
  • Shivnath Babu
چکیده

Modern industrial, government, and academic organizations are collecting massive amounts of data (“big data”) at an unprecedented scale and pace. The ability to perform timely and costeffective analytical processing of such large datasets to extract deep insights is now a key ingredient for success. These insights can drive automated processes for advertisement placement, improve customer relationship management, and lead to major scientific breakthroughs. Existing database systems are adapting to the new status quo while large-scale dataflow systems (like Dryad and MapReduce) are becoming popular for analytical workloads on big data. My research interests are in ease-of-use, manageability, and automated tuning of such large-scale data processing systems. Ensuring good and robust system performance poses several new challenges. First, workloads are now analyzing big data consisting of a hybrid mix of structured and unstructured datasets stored in nontraditional data layouts. The structure and properties of the data may not be known initially, and will evolve over time. Complex analysis techniques and rapid development needs often require the use of both declarative and procedural programming languages. Finally, the space of tuning choices is extremely high-dimensional, with choices ranging from various workload configuration settings to cluster provisioning and data layouts. My research work involves (1) exploring novel optimization opportunities in the MapReduce platform that range from the job level to the workload level, while considering factors like scheduling, data layouts and provisioning; (2) exploiting new data layouts and partitioning for improving performance and system manageability in both database and dataflow systems; (3) introducing a SQL-tuning-aware query optimizer that is capable of improving on current query plans by executing some subplans proactively, collecting monitoring data from the runs, and iterating; and (4) using database-style optimization strategies to enable I/O-efficient statistical computing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MapReduce Programming and Cost-based Optimization? Crossing this Chasm with Starfish

MapReduce has emerged as a viable competitor to database systems in big data analytics. MapReduce programs are being written for a wide variety of application domains including business data processing, text analysis, natural language processing, Web graph and social network analysis, and computational science. However, MapReduce systems lack a feature that has been key to the historical succes...

متن کامل

Big Data Analytics and Now-casting: A Comprehensive Model for Eventuality of Forecasting and Predictive Policies of Policy-making Institutions

The ability of now-casting and eventuality is the most crucial and vital achievement of big data analytics in the area of policy-making. To recognize the trends and to render a real image of the current condition and alarming immediate indicators, the significance and the specific positions of big data in policy-making are undeniable. Moreover, the requirement for policy-making institutions to ...

متن کامل

PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs

The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The map and reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which make them black-box in nature and therefore renders it difficult...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

Application of Big Data Analytics in Power Distribution Network

Smart grid enhances optimization in generation, distribution and consumption of the electricity by integrating information and communication technologies into the grid. Today, utilities are moving towards smart grid applications, most common one being deployment of smart meters in advanced metering infrastructure, and the first technical challenge they face is the huge volume of data generated ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011